⚡ Bolt: [performance improvement] Optimize keyword delimiter matching#281
⚡ Bolt: [performance improvement] Optimize keyword delimiter matching#281bashandbone wants to merge 1 commit intomainfrom
Conversation
This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance. Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex. Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideOptimizes keyword delimiter matching in DelimiterChunker by replacing per-keyword regex scans with a single combined compiled regex pass, adding a guard for empty keyword sets, and fixing a minor docstring typo. Sequence diagram for optimized keyword delimiter matchingsequenceDiagram
actor Caller
participant DelimiterChunker
participant RegexEngine
participant DelimiterMatchList as Matches
activate DelimiterChunker
Caller->>DelimiterChunker: _match_keyword_delimiters(content, keyword_delimiters, matches)
DelimiterChunker->>DelimiterChunker: filter keyword_delimiters with nonempty start
DelimiterChunker->>Caller: return matches if keyword_delimiters is empty
deactivate DelimiterChunker
activate DelimiterChunker
DelimiterChunker->>DelimiterChunker: start_strings = [d.start for d in keyword_delimiters]
DelimiterChunker->>RegexEngine: compile combined_pattern with alternation
RegexEngine-->>DelimiterChunker: combined_pattern
DelimiterChunker->>DelimiterChunker: delimiter_map = {d.start: d}
loop for each match in combined_pattern.finditer(content)
DelimiterChunker->>RegexEngine: combined_pattern.finditer(content)
RegexEngine-->>DelimiterChunker: match
DelimiterChunker->>DelimiterChunker: matched_text = match.group(0)
DelimiterChunker->>DelimiterChunker: delimiter = delimiter_map[matched_text]
DelimiterChunker->>DelimiterChunker: keyword_pos = match.start()
DelimiterChunker->>DelimiterChunker: _is_inside_string_or_comment(content, keyword_pos)
alt inside string or comment
DelimiterChunker->>DelimiterChunker: continue to next match
else outside string and comment
DelimiterChunker->>DelimiterChunker: _find_next_structural_with_char(content, start, allowed)
DelimiterChunker-->>DelimiterChunker: struct_start, struct_char
alt no structural opening
DelimiterChunker->>DelimiterChunker: continue to next match
else structural opening found
DelimiterChunker->>DelimiterChunker: _find_matching_close(content, struct_start, struct_char, closing_char)
DelimiterChunker-->>DelimiterChunker: struct_end
alt struct_end is not None
DelimiterChunker->>DelimiterChunker: nesting_level = _calculate_nesting_level(content, keyword_pos)
DelimiterChunker->>Matches: append DelimiterMatch(delimiter, keyword_pos, struct_end, nesting_level)
else struct_end is None
DelimiterChunker->>DelimiterChunker: continue to next match
end
end
end
end
DelimiterChunker-->>Caller: matches
Updated class diagram for DelimiterChunker keyword delimiter matchingclassDiagram
class DelimiterChunker {
+_match_keyword_delimiters(content: str, keyword_delimiters: list[Delimiter], matches: list[DelimiterMatch]) list[DelimiterMatch]
-_is_inside_string_or_comment(content: str, position: int) bool
-_find_next_structural_with_char(content: str, start: int, allowed: set[str]) tuple[int,str]
-_find_matching_close(content: str, struct_start: int, struct_char: str, closing_char: str) int
-_calculate_nesting_level(content: str, position: int) int
}
class Delimiter {
+start: str
+end: str
+type: str
}
class DelimiterMatch {
+delimiter: Delimiter
+start_pos: int
+end_pos: int
+nesting_level: int
}
DelimiterChunker ..> Delimiter : uses
DelimiterChunker ..> DelimiterMatch : creates
note for DelimiterChunker "_match_keyword_delimiters now uses a single combined regex pattern and delimiter_map for keyword matching"
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new
delimiter_map = {d.start: d ...}assumes keywordstartstrings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each. - The combined regex changes the ordering of
matchesfrom grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly. - You can slightly simplify/optimize by computing
allowed = set(structural_pairs.keys())once before the regex loop and reusing it in_find_next_structural_with_charinstead of recreating the set on every match.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `delimiter_map = {d.start: d ...}` assumes keyword `start` strings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each.
- The combined regex changes the ordering of `matches` from grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly.
- You can slightly simplify/optimize by computing `allowed = set(structural_pairs.keys())` once before the regex loop and reusing it in `_find_next_structural_with_char` instead of recreating the set on every match.
## Individual Comments
### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="472" />
<code_context>
- for match in re.finditer(pattern, content):
- keyword_pos = match.start()
+ # Create a mapping to quickly look up the delimiter by its matched start string
+ delimiter_map = {d.start: d for d in keyword_delimiters}
- # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**issue (bug_risk):** Handling duplicate keyword start strings may change behavior compared to the previous implementation.
Previously, each delimiter was processed independently with `re.finditer`, so duplicate `start` values would each be applied (possibly with different end behavior/metadata). With `delimiter_map = {d.start: d for d in keyword_delimiters}`, only the last delimiter for a given `start` is ever used. If duplicate `start` values are valid in this context, this changes semantics. Consider either enforcing unique `start` values up front and failing fast, or mapping each `start` to a list of delimiters to preserve the prior behavior.
</issue_to_address>
### Comment 2
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="477" />
<code_context>
- struct_start, struct_char = self._find_next_structural_with_char(
- content,
- start=keyword_pos + len(delimiter.start),
- allowed=set(structural_pairs.keys()),
- )
+ # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**suggestion (performance):** The `allowed` set is rebuilt on every match and could be hoisted out of the loop.
This recreates `set(structural_pairs.keys())` for every keyword occurrence. Since `structural_pairs` is constant in this method, compute `allowed_structurals = set(structural_pairs.keys())` once before `for match in combined_pattern.finditer(content):` and pass `allowed=allowed_structurals` here to avoid repeated allocation in the hot inner loop.
Suggested implementation:
```python
allowed_structurals = set(structural_pairs.keys())
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
keyword_pos = match.start()
```
Within the same method where this loop lives, find the call(s) to:
`self._find_next_structural_with_char(...)` that currently pass `allowed=set(structural_pairs.keys())` and change those arguments to `allowed=allowed_structurals` to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use `allowed_structurals`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for match in re.finditer(pattern, content): | ||
| keyword_pos = match.start() | ||
| # Create a mapping to quickly look up the delimiter by its matched start string | ||
| delimiter_map = {d.start: d for d in keyword_delimiters} |
There was a problem hiding this comment.
issue (bug_risk): Handling duplicate keyword start strings may change behavior compared to the previous implementation.
Previously, each delimiter was processed independently with re.finditer, so duplicate start values would each be applied (possibly with different end behavior/metadata). With delimiter_map = {d.start: d for d in keyword_delimiters}, only the last delimiter for a given start is ever used. If duplicate start values are valid in this context, this changes semantics. Consider either enforcing unique start values up front and failing fast, or mapping each start to a list of delimiters to preserve the prior behavior.
| for match in combined_pattern.finditer(content): | ||
| matched_text = match.group(0) | ||
| delimiter = delimiter_map[matched_text] | ||
| keyword_pos = match.start() |
There was a problem hiding this comment.
suggestion (performance): The allowed set is rebuilt on every match and could be hoisted out of the loop.
This recreates set(structural_pairs.keys()) for every keyword occurrence. Since structural_pairs is constant in this method, compute allowed_structurals = set(structural_pairs.keys()) once before for match in combined_pattern.finditer(content): and pass allowed=allowed_structurals here to avoid repeated allocation in the hot inner loop.
Suggested implementation:
allowed_structurals = set(structural_pairs.keys())
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
keyword_pos = match.start()Within the same method where this loop lives, find the call(s) to:
self._find_next_structural_with_char(...) that currently pass allowed=set(structural_pairs.keys()) and change those arguments to allowed=allowed_structurals to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use allowed_structurals.
There was a problem hiding this comment.
Pull request overview
Optimizes DelimiterChunker._match_keyword_delimiters by scanning the input text once with a combined keyword regex instead of running re.finditer per keyword, aiming to reduce chunking overhead on large files.
Changes:
- Combine all keyword-start strings into a single compiled regex and map matches back to delimiters.
- Add a guard to avoid compiling an empty keyword regex when keyword delimiters are filtered down to none.
- Fix a docstring spelling (“normalised” → “normalized”).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| start_strings = [d.start for d in keyword_delimiters] | ||
| combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b") | ||
|
|
||
| for match in re.finditer(pattern, content): | ||
| keyword_pos = match.start() | ||
| # Create a mapping to quickly look up the delimiter by its matched start string | ||
| delimiter_map = {d.start: d for d in keyword_delimiters} | ||
|
|
||
| # Skip if keyword is inside a string or comment | ||
| if self._is_inside_string_or_comment(content, keyword_pos): | ||
| continue | ||
| for match in combined_pattern.finditer(content): | ||
| matched_text = match.group(0) | ||
| delimiter = delimiter_map[matched_text] | ||
| keyword_pos = match.start() |
There was a problem hiding this comment.
delimiter_map = {d.start: d ...} assumes keyword starts are unique. In this codebase they are not (e.g., STRUCT_PATTERN and TYPE_ALIAS_PATTERN both include start string "type" via families.py), so this change will drop one delimiter and change which kind/priority wins during overlap resolution. Consider mapping start -> list[Delimiter] (or selecting the highest-priority delimiter per start) and emitting matches for each delimiter associated with the matched start string to preserve prior behavior.
| # Optimization: Combine all keyword start strings into a single compiled regex pattern. | ||
| # This allows us to make a single pass over the content rather than iterating over | ||
| # `re.finditer` for each keyword delimiter individually, significantly reducing overhead. | ||
| start_strings = [d.start for d in keyword_delimiters] | ||
| combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b") | ||
|
|
||
| for match in re.finditer(pattern, content): | ||
| keyword_pos = match.start() | ||
| # Create a mapping to quickly look up the delimiter by its matched start string | ||
| delimiter_map = {d.start: d for d in keyword_delimiters} |
There was a problem hiding this comment.
This optimization changes how keyword delimiters are associated to matches. There are currently unit tests for delimiter chunking behavior, but none appear to cover the case where multiple keyword delimiters share the same start (e.g., type being both STRUCT and TYPE_ALIAS in family patterns). Adding a test that asserts the resolved boundary kind/priority for such an input would help prevent regressions here.
Code Review - PR #281: Optimize keyword delimiter matchingSummaryThis PR implements a valuable performance optimization by consolidating multiple regex searches into a single pass. The core idea is sound and addresses a real O(K × N) → O(N) performance improvement. However, there are a few considerations to address. ✅ Strengths
🔍 Issues & Suggestions1. Performance issue: Set allocation in hot loop
|
Review of PR #281: Optimize keyword delimiter matchingThanks for this performance optimization! The approach of combining keyword delimiters into a single regex pattern is sound and should indeed provide O(N) performance instead of O(K × N). However, I've identified a critical correctness bug that needs to be addressed before merging. 🐛 Critical Bug: Dictionary Mapping Loses Duplicate Start StringsIssue Location: The new implementation uses a dictionary to map start strings to delimiter objects: ```python Problem: When multiple delimiters have the same `start` string but different semantic properties (kind, priority), the dictionary only keeps the last one, silently dropping the others. Real-world example: The codebase defines:
When the code encounters the keyword `type`, the old implementation would:
The new implementation will:
🔧 Suggested FixReplace the simple dictionary with a multi-value mapping: ```python Create a mapping to look up ALL delimiters by their matched start stringfrom collections import defaultdict Then update the matching loop to iterate through all matching delimiters: ```python ``` ✅ What's Good
📋 Additional Recommendations
🎯 VerdictRequest Changes - The correctness bug must be fixed before merging. The optimization approach is excellent, but the implementation needs to handle multiple delimiters per start string correctly. Test Coverage: The existing test file `tests/unit/engine/chunker/test_delimiter_patterns.py` doesn't appear to have specific tests for `_match_keyword_delimiters` behavior. Consider adding integration tests that verify "type" keyword matches both STRUCT and TYPE_ALIAS patterns. |
Code ReviewI've reviewed this PR and identified a critical bug in the optimization that needs to be addressed before merging. 🐛 Critical Issue: Duplicate Start String HandlingLocation: The optimization introduces a bug when multiple keyword delimiters share the same delimiter_map = {d.start: d for d in keyword_delimiters}Problem: When multiple delimiters have the same Concrete Example:
Both patterns include 🔧 Suggested FixThe delimiter map needs to support multiple delimiters per start string. Here are two approaches: Option 1: Map to list of delimiters from collections import defaultdict
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
matching_delimiters = delimiter_map[matched_text]
for delimiter in matching_delimiters:
keyword_pos = match.start()
# ... rest of the matching logicOption 2: Include all duplicates in regex and match all for match in combined_pattern.finditer(content):
matched_text = match.group(0)
matching_delimiters = [d for d in keyword_delimiters if d.start == matched_text]
for delimiter in matching_delimiters:
# ... matching logicOption 1 is more efficient as it pre-builds the lookup structure. ✅ Positive Aspects
📋 Testing RecommendationAs mentioned in the Copilot review comment, this PR should include a test that validates behavior when multiple keyword delimiters share the same def test_keyword_delimiters_with_duplicate_starts():
"""Test that multiple delimiters with same start string are all matched."""
# Test code that uses 'type' keyword which appears in both STRUCT and TYPE_ALIAS patterns
code = """
type MyStruct struct {
field string
}
type MyAlias = string
"""
# Verify that both STRUCT and TYPE_ALIAS boundaries are detected
# and resolved based on context/priority🎯 RecommendationRequest Changes - The bug must be fixed before merging. Once the duplicate start handling is corrected and appropriate tests are added, this optimization will be a valuable improvement to the codebase. 📊 Impact Assessment
Next Steps:
|
Code ReviewI've reviewed this PR and found a critical bug in the optimization that will cause incorrect behavior when there are multiple keyword delimiters with the same start string. Issue: Dictionary Comprehension Loses Duplicate Start StringsOn line 472 in delimiter_map = {d.start: d for d in keyword_delimiters}This dictionary comprehension will silently drop all but the last delimiter when multiple delimiters share the same
The map will only contain Delimiter 2, and Delimiter 1 will be completely lost. This breaks the original behavior where each delimiter was processed independently. The FixTo maintain correctness while keeping the performance optimization, you need to use a multi-value map. Here's the corrected approach: # Create a mapping to quickly look up delimiters by their matched start string
# Use a list to handle multiple delimiters with the same start string
from collections import defaultdict
delimiter_map: dict[str, list[Delimiter]] = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
keyword_pos = match.start()
# Process ALL delimiters that match this start string
for delimiter in delimiter_map[matched_text]:
# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
# ... rest of the logicAdditional ObservationsPositives: Concerns: RecommendationRequest Changes - The duplicate key issue must be fixed before merging. The optimization is valuable, but correctness cannot be compromised. Once fixed, this will be an excellent performance improvement for keyword delimiter matching. |
|
I pulled this down locally to address the duplicate-keyword-start bug from the review. I don't have push access to The change preserves all delimiters that share the same keyword start instead of collapsing them to the last map entry, which restores the original per-delimiter behavior while keeping the single-pass regex optimization. If you'd rather keep everything on #281, feel free to cherry-pick |
|
closing in favor of #282 |
|
closing in favor of #282 |
Understood. Acknowledging that this work is now obsolete and stopping work on this task. |
1 similar comment
Understood. Acknowledging that this work is now obsolete and stopping work on this task. |
💡 What:
The
_match_keyword_delimitersmethod in the DelimiterChunker has been optimized. Instead of looping through each potential keyword delimiter and running a separatere.finditersearch across the entire document text, the code now combines all keyword start strings into a single compiled regex pattern using alternation(?:...)and maps matched text back to the delimiter objects.🎯 Why:$O(K \times N)$ regex evaluation time (where $K$ is the number of keywords and $N$ is the size of the text). By compiling them into a single pattern, the algorithm performs a single $O(N)$ pass, avoiding redundant full-text scans and context switching. An added guard clause ensures the optimized pattern does not compile an empty expression if the
Iterating
re.finditerfor each keyword results inkeyword_delimiterslist is unexpectedly empty.📊 Impact:
Significantly reduces parsing overhead for code chunking, especially on larger files with many potential keyword delimiters. Micro-benchmarking confirms the single-pass combined regex pattern reduces execution time compared to individual regex evaluation loops in extreme scenarios.
🔬 Measurement:
To verify the improvement, run
mise //:test tests/unit/engine/chunker/test_delimiter_patterns.pyto ensure matching behavior remains precisely equivalent while leveraging the new optimized compiled regex block.PR created automatically by Jules for task 7248057790755398651 started by @bashandbone
Summary by Sourcery
Optimize keyword delimiter matching in the delimiter chunker for improved performance while preserving existing behavior.
Enhancements: